Efficient Name Disambiguation for Large-Scale Databases
نویسندگان
چکیده
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.
منابع مشابه
Metadata for Name Disambiguation and Collocation
Searching names of persons, families, and organizations is often difficult in online databases because different persons or organizations frequently share the same name and because a single person’s or organization’s name may appear in different forms in various online documents. Databases and search engines can use metadata as a tool to solve the problem of name ambiguity and name variation in...
متن کاملDistortive Effects of Initial-Based Name Disambiguation on Measurements of Large-Scale Coauthorship Networks
Scholars have often relied on name initials to resolve name ambiguities in large-scale coauthorship network research. This approach bears the risk of incorrectly merging or splitting author identities. The use of initial-based disambiguation has been justified by the assumption that such errors would not affect research findings too much. This paper tests this assumption by analyzing coauthorsh...
متن کاملScalable Name Disambiguation using Multi-level Graph Partition
When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) “names” of entities are used as their identifier, the problem is often referred to as the name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., if only last name is used as the identifier, one cannot distingui...
متن کاملAn Approach to Web-Scale Named-Entity Disambiguation
We present a multi-pass clustering approach to large scale, wide-scope named-entity disambiguation (NED) on collections of web pages. Our approach uses name co-occurrence information to cluster and hence disambiguate entities, and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasingly difficult as the corpus size increases, not only because of the...
متن کاملStructural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation
Name ambiguity problem has raised urgent demands for efficient, high-quality named entity disambiguation methods. In recent years, the increasing availability of large-scale, rich semantic knowledge sources (such as Wikipedia and WordNet) creates new opportunities to enhance the named entity disambiguation by developing algorithms which can exploit these knowledge sources at best. The problem i...
متن کامل